Current Issue : April - June Volume : 2012 Issue Number : 2 Articles : 6 Articles
In high-quality conferencing systems, it is desired to perform noise reduction with as limited speech distortion as\r\npossible. Previous work, based on time varying amplification controlled by signal-to-noise ratio estimation in\r\ndifferent frequency subbands, has shown promising results in this regard but can suffer from problems in\r\nsituations with intense continuous speech. Further, the amount of noise reduction cannot exceed a certain level in\r\norder to avoid artifacts. This paper establishes the problems and proposes several improvements. The improved\r\nalgorithm is evaluated with several different noise characteristics, and the results show that the algorithm provides\r\neven less speech distortion, better performance in a multi-speaker environment and improved noise suppression\r\nwhen speech is absent compared with previous work....
Recently, audio segmentation has attracted research interest because of its usefulness in several applications like\r\naudio indexing and retrieval, subtitling, monitoring of acoustic scenes, etc. Moreover, a previous audio\r\nsegmentation stage may be useful to improve the robustness of speech technologies like automatic speech\r\nrecognition and speaker diarization. In this article, we present the evaluation of broadcast news audio\r\nsegmentation systems carried out in the context of the AlbayzÃ?Ân-2010 evaluation campaign. That evaluation\r\nconsisted of segmenting audio from the 3/24 Catalan TV channel into five acoustic classes: music, speech, speech\r\nover music, speech over noise, and the other. The evaluation results displayed the difficulty of this segmentation\r\ntask. In this article, after presenting the database and metric, as well as the feature extraction methods and\r\nsegmentation techniques used by the submitted systems, the experimental results are analyzed and compared,\r\nwith the aim of gaining an insight into the proposed solutions, and looking for directions which are promising....
This article proposes a multiscale product (MP)-based method for estimating the open quotient (OQ) from the\r\nspeech waveform. The MP is operated by calculating the wavelet transform coefficients of the speech signal at\r\nthree scales and then multiplying them. The resulting MP signal presents negative peaks informing about the\r\nglottis closure, and positive ones informing about the glottis opening. Taking into account the shape of the\r\nspeech MP close to the derivative of electroglottographic (EGG) signal, we proceed to a correlation analysis for the\r\nfundamental frequency and OQ measurement. The approach validation is done on voiced parts of the Keele\r\nUniversity database by calculating the absolute and relative errors between the OQ estimated from the speech and\r\nthe corresponding EGG signals. When considering the mean OQ over each voiced segments, results of our test\r\nshow that OQ is estimated within an absolute error from 0.04 to 0.1 and a relative error from 8 to 21% for all the\r\nspeakers. The approach is not so performant when the evaluation concerns the OQ frame-by-frame measurements.\r\nThe absolute error reaches 0.12 and the relative error 30%....
In this article, a novel technique based on the empirical mode decomposition methodology for processing speech\r\nfeatures is proposed and investigated. The empirical mode decomposition generalizes the Fourier analysis. It\r\ndecomposes a signal as the sum of intrinsic mode functions. In this study, we implement an iterative algorithm to\r\nfind the intrinsic mode functions for any given signal. We design a novel speech feature post-processing method\r\nbased on the extracted intrinsic mode functions to achieve noise-robustness for automatic speech recognition.\r\nEvaluation results on the noisy-digit Aurora 2.0 database show that our method leads to significant performance\r\nimprovement. The relative improvement over the baseline features increases from 24.0 to 41.1% when the\r\nproposed post-processing method is applied on mean-variance normalized speech features. The proposed method\r\nalso improves over the performance achieved by a very noise-robust frontend when the test speech data are\r\nhighly mismatched....
Time delay estimation (TDE) is a fundamental subsystem for a speaker localization and tracking system. Most of the\r\ntraditional TDE methods are based on second-order statistics (SOS) under Gaussian assumption for the source. This\r\narticle resolves the TDE problem using two information-theoretic measures, joint entropy and mutual information\r\n(MI), which can be considered to indirectly include higher order statistics (HOS). The TDE solutions using the two\r\nmeasures are presented for both Gaussian and Laplacian models. We show that, for stationary signals, the two\r\nmeasures are equivalent for TDE. However, for non-stationary signals (e.g., noisy speech signals), maximizing MI\r\ngives more consistent estimate than minimizing joint entropy. Moreover, an existing idea of using modified MI to\r\nembed information about reverberation is generalized to the multiple microphones case. From the experimental\r\nresults for speech signals, this scheme with Gaussian model shows the most robust performance in various noisy\r\nand reverberant environments....
The frequency-to-channel mapping for Cochlear implant (CI) signal processors was originally designed to optimize\r\nspeech perception and generally does not preserve the harmonic structure of music sounds. An algorithm aimed\r\nat restoring the harmonic relationship of frequency components based on semitone mapping is presented in this\r\narticle. Two semitone (Smt) based mappings in different frequency ranges were investigated. The first, Smt-LF,\r\ncovers a range from 130 to 1502 Hz which encompasses the fundamental frequency of most musical instruments.\r\nThe second, Smt-MF, covers a range from 440 to 5040 Hz, allocating frequency bands of sounds close to their\r\ncharacteristic tonotopical sites according to Greenwood�s function. Smt-LF, in contrast, transposes the input\r\nfrequencies onto locations with higher characteristic frequencies. A sequence of 36 synthetic complex tones (C3 to\r\nB5), each consisting of a fundamental and 4 harmonic overtones, was processed using the standard (Std), Smt-LF\r\nand Smt-MF mappings. The analysis of output signals showed that the harmonic structure between overtones of\r\nall complex tones was preserved using Smt mapping. Semitone mapping preserves the harmonic structure and\r\nmay in turn improve music representation for Nucleus cochlear implants. The proposed semitone mappings\r\nincorporate the use of virtual channels to allow frequencies spanning three and a half octaves to be mapped to 43\r\nstimulation channels. A pitch difference limen test was done with normal hearing subjects discriminating pairs of\r\npure tones with different semitone intervals which were processed by a vocoder type simulator of CI sound\r\nprocessing. The results showed better performance with wider semitone intervals. However, no significant\r\ndifference was found between 22 and 43 channels maps....
Loading....